Rethinking Feature Generalization in Vacant Space Detection

Vacant space detection is critical in modern parking lots. However, deploying a detection model as a service is not an easy task. As the camera in a new parking is set up at different heights or viewing angles from the original parking lot where the training data are collected, the performance of the vacant space detector could be degraded. Therefore, in this paper, we proposed a method to learn generalized features so that the detector can work better in different environments. In detail, the features are suitable for a vacant detection task and robust to environmental change. We use a reparameterization process to model the variance from the environment. In addition, a variational information bottleneck is used to ensure the learned feature focus on only the appearance of a car in a specific parking space. Experimental results show that performances on a new parking lot increase significantly when only data from source parking are used in the training phase.


Introduction
Vacant space detection systems can help drivers quickly and easily find available parking spaces, reducing the time and frustration associated with circling a lot looking for a space. This can also help to reduce congestion in the parking lot, improve traffic flow, and reduce emissions from idling vehicles. Additionally, parking lot managers can use the data collected by these systems to optimize parking lot layouts and improve overall efficiency.
Recently, many sensors have been available for use in vacant space detection. For instance, ultrasonic sensors [1,2] use sound waves to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and can accurately detect the distance to the nearest object such as a vehicle. Magnetic sensors [3,4] use a magnetic field to detect the presence or absence of a vehicle in a parking space. They are typically installed beneath the surface of the parking space and can detect changes in the magnetic field caused by the presence of a vehicle. Infrared sensors [5] use infrared light to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and can detect changes in the infrared light caused by the presence of a vehicle. Video cameras [6][7][8][9][10][11] can be used to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and use image analysis software to detect changes in the video feed caused by the presence of a vehicle.
Among these possible sensors, many research works focus on vision-based systems to detect vacant spaces because a single camera can manage multi-parking slots. Several challenges should be addressed to have a workable vacant space detector, such as mutual occlusion patterns, shadow, layout, and size variances. From a literature review, supervised learning methods (Vu 2019 [12] and Huang 2017 [13]) require massive labeled data for a high accuracy of 99.74%. However, the domain gap could significantly reduce accuracy when the model is deployed in a new parking lot. As shown in Figure 1, conventional methods may make a wrong prediction in a new testing domain, while our method can learn generalized features that work robustly in a new testing domain. We set up two cameras to capture the same parking lot but at different angle views. These angle views represent the source and target domains. Samples are collected as a normalized three-parking slot Vu 2019 [12]. Training and evaluating datasets are collected on the source domain, and a testing dataset is collected on the target domain. The first row is the result of the proposed method; the second row is the conventional method's result.
Transfer learning [14] and unsupervised domain adaptation [15] are popular solutions to relieve the above-mentioned problem. Transfer learning [14] uses a pre-trained model to initialize the weight of the target network. As the pre-trained model is trained based on a vast dataset, its parameters could extract semantic information. Hence, the target model could converge faster with fewer training data. In the vacant space application, the pre-trained model could be trained based on data from a source parking lot; and the target network is a detector deployed in a new parking lot. Given a powerful pre-trained model, transfer learning requires enough newly labeled samples to fine-tune the target network to ensure success.
On the other hand, unsupervised domain adaptation [15] requires labeled data on a source domain and unlabeled data on a target domain. An adversarial loss [16] is used to train the sharing features between two domains, and labeled data in the source domain will be used to train a classifier based on the sharing features. To ensure the success of domain adaptation, the learned sharing features should be good enough for the specific task.
Both transfer learning and unsupervised domain adaptation need help from a pretrained feature extractor trained on the source dataset. These features should be highly relevant to vacant states and invariant from environmental changes. Since different camera views create different occlusion patterns, learning in-variant features from an unseen domain (target domain) is challenging. To address the challenge, the paper tries to learn features that represent only objects (parking cars) but cannot present the background; even the background includes partial neighbor cars. We use mutual information [17] to measure the amount of information shared by (features and objects) as well as (features and backgrounds); the network is trained with an additional VIB loss.
The idea behind the proposed method is quite similar to the concept in TCL2021 [18] regarding feature learning. TCL2021 [18] uses a trajectory with sequential frames to train the detector. The frames are selected before and after a car moves in/out of a slot; hence, they have pairwise vacant and occupied samples that share the same background but diffident vacant states. As shown in Figure 2, the pairwise samples in a trajectory had similar appearances; the only difference between the samples is at some pixels where the car is parked. These pairing data function as a contrastive loss where two samples in a trajectory are forced to move far from each other; hence, the model focuses on necessary pixels and eliminates all background pixels. While many promising results have been reported in TCL2021 [18], this method relies on a motion classifier to prepare training trajectories. Moreover, the training process needs a task-consistency reward to train the vacant-space detector in a reinforcement manner. This makes the training process more complex and requires powerful hardware to store all frames in a trajectory for every iteration update.  [12] on source and target domains. They are collected at the same slot and belong to a trajectory [18]. Each row represents samples at the same slot but from different domains. Each column represents samples in a trajectory where the background is similar, but the differences are from the object at the middle slot.
Unlike the complex training process in TCL2021 [18], our work learns crucial features using a simple process. Here, the model includes a feature extractor and a classifier. Given an input image x, the extractor extracts the feature µ ∈ R d , and the corresponding variance σ ∈ R d . As the classifier must work well with variant features, we use a reparameterization trick to add some uncertainty to the training process. In detail, a latent feature z is sampled from the distribution z ∼ N (µ, σ 2 ); then, the vacant state y is predicted from sampled feature z by y = cls(z; θ cls ).
A good feature z must represent occlusion patterns that predict well vacant states. Hence, the mutual information I(y; z) should be maximized. Additionally, the background information from input x should be eliminated in z. Hence, the mutual information I(x; z) should be small. Two constraints are optimized together in a variational information bottleneck (VIB) loss [19]. Since this loss could be integrated into any supervised learning framework, the training process becomes simpler than task consistency learning in TCL2021 [18].
In short, the contributions of this paper are the following: • We propose a learning method to learn better features for vacant parking space detection. The trained model can work better on the unseen domains, while no data from the unseen domain are available in the training phase; • A reparameterization trick is used to learn a classifier that is well adapted to environmental changes; • A variational information bottleneck loss is used to learn features focused on occlusion patterns and eliminate the background.

Related Works
Vacant space detection for vehicle parking has been an interesting research topic for many years. Early research Lin 1997 [6], Raja 1999 [20], Raja 2000 [7], and Raja 2001 [21] relied on feature engineering processes to detect a car in a specific position. Given a selected location, Lin 1997 [6] introduces geometrical models with a spatial contour matching approach and a careful tuning process to fit well the specific scenario. Raja 1999 [20], and Raja 2001 [21] proposed a learning-based approach to model the unknown distribution of images that contain vehicles by utilizing the higher-order statistics (HOS) information as the feature. Raja 1999 [7] advanced to detect and track vehicles in videos by combining statistical knowledge about the visual appearance of vehicles with their motion information. However, most of the work focuses on a few slots but does not use only one camera to manage all slots in a wide parking area. This may limit particular applications.
With the development of CCTV systems, applying vehicle detection algorithms in the parking lot for vacant space detection has become possible in the recent decade. Therefore, much research has been conducted to pursue desirable performance and overcome challenging issues in practice (Paolo 2013 [22], Lixia 2012 [23], Wu 2007 [24], Huang 2010 [25], and Huang 2013 [26]). Here, images captured by CCTV cameras may include many slots. A detector must address both localization and classification tasks. Since a car should be parked in a specific 3D slot, prior 3D information could be used to address the localization task. In Huang 2010 [25], geometry and image information are fused to generate all vacant states of all slots. Additionally, neighbor slots help to correct the prediction at a query slot via Bayesian inference. Here, six 3D surfaces represent the 3D cube, and each surface is projected to a 2D image to find a corresponding region. Each region is processed independently, and a hierarchy Bayesian inference fuses predictions from these planes to a final vacant state. The Bayesian hierarchical structure can model the occlusion relationship among all neighboring slots to improve accuracy. However, the inference time of Huang 2010 [25] is too long, and the computing cost needs to be lowered to fit the real-world application. Later, in Huang 2013 [26], the Bayesian hierarchy is replaced by a multi-layer inference framework to learn the correction process. While Huang 2013 [26] presented multi-processing steps, Huang 2015 [27] models each processing step as a layer of a unique model. It helps the solution of Huang 2015 [27] to be simpler than the Bayesian hierarchy Huang 2013 [26]. Although considerable results are achieved, these methods require experts to tune hyper-parameters manually for each inference layer.
With the rapid development of deep learning techniques, many applications are solved using the deep model. The multi-layer in Huang 2015 [27] could be replaced by a unique deep learning model in Huang 2017 [13]. Each layer in Huang 2015 [27] is modeled by a specific block; later, the network is trained end-to-end. Following the experimental result in Huang 2017 [13], a detector may make a wrong prediction if a car is too big or parks not at the center of a slot. Therefore, in Huang 2017 [13], a spatial transform network is used to select a suitable region of the input image for classification. To ensure the input image can cover all necessary information so that a spatial transform could select a suitable ROI, the authors of Huang 2017 [13] used a normalized three-parking image to represent input and gain higher accuracy in real-time. Next, Vu 2019 [12] provides a contrastive loss to learn better feature representation. Using a deep learning framework, the methods in Huang 2017 [13] and Vu 2019 [12] perform better than traditional approaches without fancy feature engineering processes or heavy hyper-parameter adjustment. To deploy the vacant detector in a new parking lot, Zeng 2022 [28] further generalized the deep learning-based approaches to significantly different lighting scenarios with the adversarial domain adaptation technique. Both Vu 2019 [12] and Zeng 2020 [28] reused the normalized three-parking-space image setting in Huang 2017 [13] and obtained high-accuracy results.
Although promising results were reported in Vu 2019 [12], and Zeng 2022 [28], the success of these deep learning approaches comes from a vast dataset. Unfortunately, due to the domain gap issue, using these approaches requires labeling once for each parking lot, which causes enormous manual effort. To solve this issue, TCL2021 [18] proposes using an optical-flow-based motion classifier as the guidance to train the vacant space detector. Specifically, TCL2021 [18] found that RGB-image-based models are affected seriously by camera poses or various lighting conditions. By contrast, the motion classifier, which takes a sequence of optical flow to predict the motion state of a trajectory, is robust to such factors. Therefore, the authors train a vacant space detector using consistency with a flow-based motion classifier. The motion classifier could be trained in a parking lot; later, it helps to train vacant detectors in another parking lot. Using consistency between the two models, the vacant space detector on a new parking lot could be trained without human labor. However, the motion classifier may make a wrong prediction with high confidence; this phenomenon seriously affects the trained detector. Moreover, estimating a consistency reward is complicated, and this method requires high-capacity hardware to store all frames in a trajectory.

Problem Statement
Given a parking lot monitored by a surveillance camera mounted at a high position, we would like to train a vacant space detector to detect a vacant state at any slot in the parking space. However, unlike the conventional methods (Huang 2017 [13], and Vu 2019 [12]), where training and testing images are from the same parking lot, our work uses training images and testing images from different domains. Figure 1 shows the images captured from two domains (different angle views). The domain gap by camera poses results in different appearances in the same parking lot.
Conventionally, a detector must address localization and classification tasks. In a vacant space detector, the localization task could be well addressed by the 3D position of a parking slot. Huang 2017 [13] used a 3D cube to model each slot, and the projection of the cube to a 2D image serves as a localization result. Hence, we can focus on the classification task provided by the local patch. Following a careful design, Vu 2019 [12] introduced a three-space normalization as an input of the classifier. Here, three neighbor slots are concatenated and projected on a 2D image. The projection is normalized as an input of a classifier. Some examples of the normalization samples are shown in Figure 2.
Denote as D S = {x s i , y s i } i=1...N a dataset collected on the source domain in Figure 1. We separate it into training and validating datasets (D S Train , D S Val ). A detector is trained and validated using these datasets; then, the detector is tested on a testing dataset D T Test . Here, D T Test is collected on the target domain using the same 3-space-normalization method. Note that D S Train and D T Test are collected on different days, and no image in the target domain is used during the training process. If the trained model predicts well in the target dataset, the method could learn generalized features invariant to the camera angle. Some examples of D S Train and D T Test are in Figure 2. Figure 3 presents the overview of the proposed method. First, a pre-trained model extracts flattened feature f = CNN(x) ∈ R D . The features are fed forward to an encoder to generate µ ∈ R d and σ ∈ R d vectors. Two vectors are used to estimate the Kullback-Leibler divergence loss [29]; additionally, a reparameterization process samples a new latent z. A classifier takes the latent z and predicts the vacant state y ∈ {vacant, occupy}. The paper uses VGG16 [30] as a backbone (feature extractor and neck) to provide a 512 × 7 × 7 feature map. However, the extractor could be any pre-trained backbone. In the encoder module, linear layers encode the flattened feature to a µ, σ. Table 1 summarizes the network architecture. nn.Linear

.1. Overview of VIB Loss for Vacant Space Detector
Conventionally, a deep neuron network uses an extractor to map input image x to a feature space z; the feature is fed to a classifier to predict an output state y. A generalized feature should be invariant to environmental factors so the classifier can predict well in the testing phases. In a vacant space detector, learning generalized features is a very challenging task. As shown in Figure 2, the vacant and occupied samples are similar because they share the same background. Both of them include cars in the images; the only difference is that occlusion patterns may occur while a car is parked in the middle slot. The conventional method may learn many features (e.g., car patterns) and use the pattern to predict the vacant state. These features can work well to predict the vacant state in the same domain but will be affected if the domain changes (e.g., an angle view). A good feature for vacant space detection is a very compressed feature that focuses only on the occlusion pattern in the mid-slot.
The information bottleneck principle [31] introduces a view of supervised learning to learn these good features. The principle says that the optimal model transmits as much information as possible from its input to its output through a compressed representation called the information bottleneck. In vacant space detectors, the information bottleneck could be explained by the following two constraints: • The latent z must help to predict well the output y (vacant state); • The latent z must be compressed to the input x (3-space normalization images). This means that when we know z, we cannot infer x very well.
From information theory, the mutual information [17] can measure the information of a variable via another variable as in Equation (1). Hence, the above constraints could be explained by maximizing the I(y; z) and minimizing I(x; z). Denote as I c the information constraint that I(x; z) < I c ; the Lagrangian of the above-constrained optimization problem is a maximization solution of Equation (2). Here, β is a Lagrange multiplier.

Classification Loss in VIB
By applying the definition in Equation (1), the first term I(y; z) could be rewritten as Equation (3). Here, p(y | z) is defined by an intractable function as Equation (4).
= p(y, z) log p(y | z)dydz − p(y) log p(y)dy. (3) Since Equation (4) is intractable, the authors of [19] used a lower bound to approximate p(y | z). Denote as q(y | z) a variational approximation of p(y | z); the lower bound is founded by Kullback-Leibler divergence as in Equation (5).
where the entropy of the labels H(y) = − p(y) log p(y)dy is independent and could be ignored.

Feature Selection Loss in VIB
The second term I(x; z) is rewritten as Equation (7).
Let q(z) be a variational approximation to the marginal p(z). Using the KL divergence, the upper bound of I(x; z) is introduced as Equation (8).
Applying the upper bound to Equation (7), the I(x; z) is rewritten as Equation (9).

Approximation Loss Function
By applying the lower bound of the I(y; z) and the upper bound of I(x; z), the Lagrangian function in Equation (2) is approximated as In a practical application where the dataset (x i , y i ) i=1..N is available, the J IB is estimated as Equation (11).
To model the loss J IB as a deep learning framework, ref. [19] use an encoder Φ to pre- The latent z could be sampled from a reparameterization trick g( , x) where ∼ p( ) = N (0, I). Thus, the final objective is a minimized solution of the loss function in Equation (12). Here, n is the number of training samples.
Here, the term q(y | g( , x)) serves as a classifier and it is learned by a binary cross entropy loss as Equation (13), whereas q(z) is the approximated latent marginal often fixed to a standard normal distribution as Equation (14) − log q(y | g( , x)) = − y logŷ + (1 − y) log(1 −ŷ) .

Datasets and Implementation Detail
Two training datasets are collected on the source domain in Figure 1. The first training dataset D S Train 1 is a standard training dataset collected over 30 days and labeled for supervised learning. We collect one image every 30 min and only in the daytime-from 6 a.m. to 6 p.m. The dataset includes 14,667 vacant slots and 20,021 occupied slots. The second training dataset D S Train 2 includes samples collected from trajectories as in TCL2021 [18]. In detail, TCL2021 [18] uses the magnitude of optical flow to detect a time slot when a car is moving. Given the optical flow in the time slot, a motion classifier is used to estimate the motion state of this car. On both sides (before and after) of the time slot, we have no-motion segments in which no car moves, as in Figure 4. We sample one sample for each no-motion segment. Then, two samples are added to the D S   [18]. The X-axis is the frame index; the Y-axis is the optical flow magnitude.
In the testing phase, two datasets are prepared. The first dataset D S Val includes 11,944 vacant samples and 7832 occupied slots collected from the other 15 days on the source parking lot from 6 am to 6 pm. The second dataset D T Test is of the same scale as the D S Val dataset but it is collected from the target domain. We use the two testing datasets to evaluate the effects of the domain gap and the generalization of models trained by the proposed method.
We follow the suggestion in [32] to implement the VIB method. However, instead of using a sum operator to estimate the loss over training patches, we use a mean operator. Given an input x and the corresponding latent z ∈ R d , the sum operator estimates KL loss d times, but the classification loss is only estimated once for each sample. In this situation, gradients from the KL loss may destroy the original backbone. Using the mean operator, the losses are averaged on both the feature and patch size dimensions. Hence, the model can converge smoothly.
In addition, the training hyper-parameters are listed below:

Ablation Study
In this section, we discuss the contribution of the KL loss on feature learning. The quality of features depends on the quality of the training dataset. Therefore, we randomly select a subset of D S  Table 2 and the F1 score for D S Val and D T Test are in Table 3. During the training process, we evaluate the model using D S Val for every epoch. The accuracy for D S Val is used to select the best model in the training phase.
Regularization methods (e.g., batch normalization, dropout) are possible solutions to avoid overfitting on a dataset and learn generalization features. Hence, we use batch normalization and dropout on the conventional supervised learning method to evaluate regularization methods on the application. Our experiment is based on the VGG model to detect vacant space. The VGG model includes a feature extractor, a neck, and a classifier. Hence, we add a batch normalization after the feature extractor and denote this setting as "No VIB (BN)" in Tables 2 and 3. In the classifier module (VGG network), the dropout layer had been applied with p = 0.5 by default. Hence, the default setting is denoted as "No VIB (p = 0.5)". To attain better generalization features, we test the performance when p = 0.7 and p = 0.9. The notations of these settings are "No VIB (p = 0.7)" and "No VIB (p = 0.9)", correspondingly.  Following the result in Tables 2 and 3, several conclusions could be drawn: • The domain gap is a critical challenge in deploying a new parking lot. In Table 2, using the S 2 dataset (5000 samples), the conventional method (No VIB (p = 0.5)) can achieve 99.46% accuracy on D S Val but it can only achieve 89.68% accuracy on D T Test ; • To learn generalized features, a vast dataset is needed. In the scenario "No VIB (p = 0.5)" in Table 2, using the S 5 dataset (30,540 samples) for training does not help to increase accuracy when evaluating on the same domain (D S val ). However, the performance increases significantly when evaluating is based on the target domain (D T test ). In detail, when the S 2 dataset is used for training, the accuracy on D T test is only 89.68%; however, when the S 5 dataset is used for training, the corresponding accuracy is 92.02%. A similar observation can be found in Table 3;  • The proposed method helps to increase the performance on the source domain if the training dataset is sparse. When the training set is S 1 dataset (2000 samples), the accuracy and F1-score are 97.95% and 97.54%, correspondingly, on D S Val . When the KL-loss (β = 0.1) is introduced, the performance improves to 98.44% for accuracy and 98.14% for the F1-Score on the source domain; • When the number of training samples increases, the proposed method helps to learn better features that work well in the target domain. The conventional model (No VIB (p = 0.5)) can reach up to 88.95% on the D T Test dataset. However, there are no apparent differences between the accuracies given by the S 3 (only 10,000 samples) and S 5 datasets. On the target domain, the accuracy given by S 3 is 88.84% and the accuracy given by S 5 is 88.95%. In contrast, given the S 5 datasets, the proposed method can achieve 97.91% accuracy and 97.31% F1 score (β = 0.1). Moreover, the differences in accuracy are apparent if more training samples are available. We can see that the performance inmproves if the number of training samples is increased. In further detail, when S 1 , S 2 , S 3 , S 4 , S 5 are applied for training, the accuracies are 92.93%, 94.73%, 95.28%, 96.51%, 97.91%, correspondingly; • The hyper-parameter β should not be too large. When β = 0.2, the performance is reduced on D T Test . This is reasonable because the KL-loss forces a feature to be closer to zero. This means that more spatial regions are not used to predict the vacant state. Following our experiment, β = 0.1 is an optimal selection in most datasets; • Batch normalization may not help in the vacant space detection application. Even if the performance on the source domain is high, the model cannot work better on the target domain. The maximum accuracy on the target domain is only 83.53% if BN is applied; • The increment of the dropout parameter may help with small training sets. If S 1 or S 2 serve as training sets, the performance improves on both source and target domains. Especially when the S 2 dataset is used for training and the dropout ratio is 0.9, the conventional method performs better than our proposed method. However, when the dataset is larger, the performance cannot improve. The accuracies on the S 2 , S 3 , S 4 , and S 5 datasets are quite similar if p = 0.5 or p = 0.7. In contrast, our VIB-based method can also learn and accept new features from larger training datasets.
We also provide experiments to evaluate the proposed method if the training dataset includes pairwise samples. Here, D S serve as a contrastive loss during the training phase. The result in Table 4 shows that a pairwise dataset may help to learn better-generalized features. Compared with the result given by the S 1 dataset, the result given by the D S does not have enough data to learn a better detector. This conclusion is consistent with the result in Table 2 when S 1 is the training set. However, the performance of the target domain is improved significantly. Without VIB, the accuracy on D T Test is 85.88% but it could increase to 94.82% when β = 0.1. In this case, the increment is 8.94%. If the S 1 dataset is used as a training set, the increment is 8.75%. This means that VIB loss can help in both types of datasets. In addition, we compare our method with other conventional vacant space detectors, including [12,13,18,[22][23][24]33]. The parking lot in our paper is similar to the parking lot in Vu 2019 [12] and TCL2021 [18]. However, Vu 2019 [12] prepared a vast dataset that includes 587,667 samples and TCL2021 [18] used a motion classifier to select pairwise samples. The two papers and our paper use a normalized three-parking slot as the input of detectors. The authors of Vu 2019 [12] also used their dataset to evaluate Paulo2013 [22], Lixia 2012 [23], Wu 2007 [24], Huang2017 [13], and Faster-RCNN [33] methods in a comprehensive comparison. From a dataset viewpoint, our dataset is quite similar to the dataset in Vu 2019 [12] but at a smaller scale. From the training process viewpoint, Vu2019 [12] does not use any pre-trained model, but the proposed method uses the pre-trained VGG model. We also prepared a simple version of the proposed method that does not use VGG as a pre-trained model.
The comparison results in Table 5 show that our method can outperform these visionbased methods as [22][23][24]. Moreover, we also compared with deep-learning-based methods [12,13,33]. Faster-RCNN [33] works for vehicle detection, but it cannot detect well occluded parked cars with small image sizes. In addition, Faster-RCNN needs to address both the localization and classification tasks, while Huang 2017 [13] and Vu 2019 [12] use 3D information to solve the localization task. Hence, their performances are better. Huang 2017 [13] and Vu 2019 [12] can provide higher accuracy for vacant space detection, but they require a vast training dataset with supervised labels. In comparison, our method also achieves an equivalent accuracy with only 5000 samples. However, Huang 2017 [13] and Vu 2019 [12] do not use any pre-trained model, but our method uses a VGG pre-trained model. For a better comparison, we also train our method from scratch with only 5000 training samples. The result shows that without the help from the VGG pre-trained model, the accuracy is reduced by 1%. This degradation is relatively small and could be compensated for when the training dataset is larger. TCL [18] uses 1000 motion trajectories to achieve a good performance, but their training process is complex because of the task consistency reward. Additionally, it requires more RAM in GPU to store all frames in a trajectory. Compared with TCL [18], our proposed method is easier to train. Table 5. Comparison to state-of-the-art vacant detectors on the same parking lot. Here, a sample refers to a normalized 3-parking-slot image and a trajectory refers to a video sequence when a car moves in/out of a slot. The bold represents the best performance in each column.

Comparison with an Upper Bound
In our work, we train and evaluate the vacant space detector on the source domain; additionally, we test it on the target domain. During the training process, no information from the target domain is used. To evaluate our learning method's performance, we compare the proposed method with unsupervised domain adaptation [28]. In unsupervised domain adaptation [28], the target model uses a source dataset and a target dataset during training phases. However, the source dataset includes label and image information, whereas the target dataset only has image information. Therefore, the unsupervised domain adaptation [28] could be treated as an upper bound of our proposed method.
In this experiment, two cameras at two angle views are set up to collect two datasets. One dataset is the source dataset and another is the target dataset. Each dataset is split into a training dataset and a validation dataset in its domain. The results in Table 6 show that when '45 • view' is used as a source dataset, the performance on the target dataset ('90 • view') reaches the upper bound. However, if '90 • view' serves as the source dataset, the performance on '45 • view' is far from the upper bound. This means that generalized features rely greatly on the source domain.

Feature Analysis
In this section, we analyze the feature maps extracted by the proposed method (with VIB) and the conventional method (without VIB). Our model is based on the VGG model that includes a feature extractor, a neck, and a classifier. Given an image, we use the feature extractor to extract feature maps. There are 512 feature maps given one input image. Figure 5 shows one example image and several corresponding feature maps in both cases (with and without VIB). The result shows that the VIB-based method extracts sparse feature maps. This means that the features learned by VIB are precise and cannot be found everywhere on the example image even though cars are at every neighbor slot. Only some spatial locations have a response on the feature map yielded by the VIB-based method. This phenomenon is reasonable because the KL loss force feature maps close to zeros. Similar to the dropout technique, a sparse feature map may avoid overfitting in a dataset and the model becomes generalized. This observation also proves the benefit of the proposed method where the model can serve as a feature selector which: • Adaptively selects features suitable for the target task; • Skips or removes redundancy features related to input images. In addition, the proposed method can not only provide sparse feature maps but also extract more empty feature maps where all values are zeros. Given one feature map, we extract the feature map's minimum, maximum, and mean values. This process is applied to 512 channels. The statistics of these variables are in Table 7. Each column represents one variable (maximum, minimum, and mean value of a feature map), and each row represents one statistics indicator. By comparing the statistics of the maximum variable between the two methods (with and without VIB), we may see that the number of empty feature maps in the VIB-based method is larger. With VIB, more than 50% of feature maps are empty maps; without VIB, the corresponding value is smaller than 50%. In addition, there is not a huge distance between VIB-based features. With VIB, the maximum values of minimum variables and maximum variables are 7.3204 and 12.6495. Without VIB, the maximum values of minimum variables and maximum variables are 1.2250 and 30.8568. Considering the maximum value, the gap between maximum and minimum variables is 5.3291 and 29.6318 with VIB and without VIB. This means that VIB can serve as a normalization process that reduces the feature variance in feature space.
Following the above discussions, VIB has feature normalization and sparse representation properties. Normalization and sparse representation are well-known solutions for a better generalization model. Hence, the proposed method can learn generalization features. In addition, during the training process, some uncertainty was added to the feature by the reparameterization step. This uncertainty may model some domain shift factors (orientation, camera field of view, camera height) but in feature space. Therefore, the model has the ability to adapt to environmental change.

Conclusions
In this paper, we apply VIB [19] to learn generalized features for vacant space detectors. The training/validating datasets are from the source domain, while the testing dataset is from the target domain. Our method is significantly better than conventional methods if the detector is tested on the target domain. Additionally, our method helps improve performance on the source domain if the training dataset is sparse.